Declaring character encodings in HTML

Question

How should I declare the encoding of my HTML file?

You should always specify the encoding used for an HTML or XML page. If you don't, you risk that characters in your content are incorrectly interpreted. This is not just an issue of human readability, increasingly machines need to understand your data too. A character encoding declaration is also needed to process non-ASCII characters entered by the user in forms, in URLs generated by scripts, and so forth. This article describes how to do this for an HTML file.

If you need to better understand what characters and character encodings are, see the article Character encodings for beginners. For information about declaring encodings for CSS style sheets, see CSS character encoding declarations.

Quick answer

Always declare the encoding of your document using a meta element with a charset attribute, or using the http-equiv and content attributes (called a pragma directive). The declaration should fit completely within the first 1024 bytes at the start of the file, so it's best to put it immediately after the opening head tag.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
...
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" 
      content="text/html; charset=utf-8">
...

It doesn't matter which you use, but it's easier to type the first one. It also doesn't matter whether you type UTF-8 or utf-8.

You should always use the UTF-8 character encoding. (Remember that this means you also need to save your content as UTF-8.) See what you should consider if you really cannot use UTF-8.

If you have access to the server settings, you should also consider whether it makes sense to use the HTTP header. Note however that, since the HTTP header has a higher precedence than the in-document meta declarations, content authors should always take into account whether the character encoding is already declared in the HTTP header. If it is, the meta element must be set to declare the same encoding.

You can detect any encodings sent by the HTTP header using the Internationalization Checker.

Details

What about the byte-order mark?

If you have a UTF-8 byte-order mark (BOM) at the start of your file then modern browsers will use that to determine that the encoding of your page is UTF-8. It has a higher precedence than any other declaration, including the HTTP header.

You could skip the meta encoding declaration if you have a BOM, but we recommend that you keep it, since it helps people looking at the source code to ascertain what the encoding of the page is.

Read more about the byte-order mark.

Should I declare the encoding in the HTTP header?

Use character encoding declarations in HTTP headers if it makes sense, and if you are able, for any type of content, but in conjunction with an in-document declaration.

Content authors should always ensure that HTTP declarations are consistent with the in-document declarations.

Pros and cons of using the HTTP header

One advantage of using the HTTP header is that user agents can find the character encoding information sooner when it is sent in the HTTP header.

On the other hand, there are a number of potential disadvantages:

  • It may be difficult for content authors to change the encoding information for static files on the server – especially when dealing with an ISP. Authors will need knowledge of and access to the server settings.

  • Server settings may get out of synchronization with the document for one reason or another. This may happen, for example, if you rely on the server default, and that default is changed. This is a very bad situation, since the higher precedence of the HTTP information versus the in-document declaration may cause the document to become unreadable.

  • There are potential problems for both static and dynamic documents if they are not read from a server; for example, if they are saved to a location such as a CD or hard disk. In these cases any encoding information from an HTTP header is not available.

    Similarly, if the character encoding is only declared in the HTTP header, this information is no longer available for files during editing, or when they are processed by such things as XSLT or scripts, or when they are sent for translation, etc.

So should I use this method?

If serving files via HTTP from a server, it is never a problem to send information about the character encoding of the document in the HTTP header, as long as that information is correct.

On the other hand, because of the disadvantages listed above we recommend that you should always declare the encoding information inside the document as well. An in-document declaration also helps developers, testers, or translation production managers who want to visually check the encoding of a document.

(Some people would argue that it is rarely appropriate to declare the encoding in the HTTP header if you are going to repeat it in the content of the document. In this case, they are proposing that the HTTP header say nothing about the document encoding. Note that this would usually mean taking action to disable any server defaults.)

Working with polyglot and XML formats

XHTML5: An XHTML5 document is served as XML and has XML syntax. XML parsers do not recognise the encoding declarations in meta elements. They only recognise the XML declaration. Here is an example:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html ....

The XML declaration is only required if the page is not being served as UTF-8 (or UTF-16), but it can be useful to include it so that developers, testers, or translation production managers can visually check the encoding of a document by looking at the source.

Polyglot markup: A page that uses polyglot markup uses a subset of HTML with XML syntax that can be parsed either by an HTML or an XML parser. It is described in Polyglot Markup: A robust profile of the HTML5 vocabulary.

Since a polyglot document must be in UTF-8, you don't need to, and indeed must not, use the XML declaration. On the other hand, if the file is to be read as HTML you will need to declare the encoding using a meta element, the byte-order mark or the HTTP header.

Since a declaration in a meta element will only be recognized by an HTML parser, if you use the approach with the content attribute its value should start with text/html;.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

If you use the meta element with a charset attribute this is not something you need to consider.

Additional information

The information in this section relates to things you should not normally need to know, but which are included here for completeness.

Working with non-UTF-8 encodings

Using UTF-8 not only simplifies authoring of pages, it avoids unexpected results on form submission and URL encodings, which use the document's character encoding by default. If you really can't avoid using a non-UTF-8 character encoding you will need to choose from a limited set of encoding names to ensure maximum interoperability and the longest possible term of readability for your content.

Until recently the IANA registry was the place to find names for encodings. The IANA registry has multiple names for the same encoding, in which case you are supposed to use the name designated as 'preferred'.

Nowadays, you should use the Encoding specification, which provides a list that has been tested against actual browser implementations. You can find the list in the table in the section called Names and labels. It is best to use the names in the left column of that table.

Note, however, that the presence of a name in either of these sources doesn't necessarily mean that it is OK to use that encoding. Several of the encodings are problematic. If you really can't use UTF-8, you should carefully consider the advice in the article Choosing & applying a character encoding.

Do not invent your own encoding names preceded by x-. This is a bad idea since it limits interoperability.

Working with legacy HTML formats

HTML 4.01 doesn't specify the use of the charset attribute with the meta element, but any recent major browser will still detect it and use it, even if the page is declared to be HTML4 rather than HTML5. This section is only relevant if you have some other reason than serving to a browser for conforming to an older format of HTML. It describes any differences from the Details section above.

For pages served as XML, see Working with polyglot and XML formats.

HTML4: As mentioned just above, you need to use the pragma directive for full conformance with HTML 4.01, rather than the charset attribute.

XHTML 1.x served as text/html: Also needs the pragma directive for full conformance with HTML 4.01, rather than the charset attribute. You do not need to use the XML declaration, since the file is being served as HTML.

XHTML 1.x served as XML: Use the encoding attribute of the XML declaration on the first line of the page. Ensure there is nothing before it, including spaces (although a byte-order mark is OK).

The charset attribute on a link

HTML5 deprecated the use of the charset attribute on an a or link element, so you should avoid using it. It originated in the HTML 4.01 specification for use with the a, link and script elements and was supposed to indicate the encoding of the document you are linking to.

It was intended for use on an embedded link element like this:

X Bad code. Don't copy!
See our <a href="/mysite/mydoc.html" charset="iso-8859-15">list of publications</a>.

The idea was that the browser would be able to apply the right encoding to the document it retrieves if no encoding is specified for the document in any other way.

There were always issues with the use of this attribute. Firstly, it is not well supported by major browsers. One reason not to support this attribute is that if browsers do so without special additional rules it would be an XSS attack vector. Secondly, it is hard to ensure that the information is correct at any given time. The author of the document pointed to may well change the encoding of the document without you knowing. If the author still hasn't specified the encoding of their document, you will now be asking the browser to apply an incorrect encoding. And thirdly, it shouldn't be necessary anyway if people follow the guidelines in this article and mark up their documents properly. That is a much better approach.

This way of indicating the encoding of a document has the lowest precedence (ie. if the encoding is declared in any other way, this will be ignored). This means that you couldn't use this to correct incorrect declarations either.

Working with UTF-16

According to the results of a Google sample of several billion pages, less than 0.01% of pages on the Web are encoded in UTF-16. UTF-8 accounted for over 80% of all Web pages, if you include its subset, ASCII, and over 60% if you don't. You are strongly discouraged from using UTF-16 as your page encoding.

If, for some reason, you have no choice, here are some rules for declaring the encoding. They are different from those for other encodings.

The HTML5 specification forbids the use of the meta element to declare UTF-16, because the values must be ASCII-compatible. Instead you should ensure that you always have a byte-order mark at the very start of a UTF-16 encoded file. In effect, this is the in-document declaration.

Furthermore, if your page is encoded as UTF-16, do not declare your file to be "UTF-16BE" or "UTF-16LE", use "UTF-16" only. The byte-order mark at the beginning of your file will indicate whether the encoding scheme is little-endian or big-endian. (This is because content explicitly encoded as, say, UTF-16BE should not use a byte-order mark; but HTML5 requires a byte-order mark for UTF-16 encoded pages.)